Big Data 项目

Map/Reduce:

Hadoop 简介:(ecosystem) hadoop包括 Map/Reduce, 分布式文件系统HDFS, 分布式数据库Hbase,

forwarding index , inverted index forwarding index : 一个未经处理的数据库中, 一般是以文档ID作为索引, 以文档的内容作为记录 从文档的角度看其中的单词, 表示每个文档(用文档ID 标识)都含有哪些单词, 以及每个单词出现了多少次(词频) 及出现位置(offset,相对于文档首部的偏移量) inverted index : 从单词的角度看文档, 标识每个单词分别在那些文档中出现(文档ID),以及在各自的文档中每个单词分别出现了多少次(词频)及其出现位置(相对于该文档首部的偏移量)。 指的是将单词或记录作为索引, 将文档ID 作为记录, 这样便可以方便的通过单词或者记录查找到其所在的文档。

1,什么是Map/Reduce

大数据核心在于分类(map), 然后根据类别在merge,从繁到简(reduce) Map 给一个数据集进行分类,key是类的名字, value为count或者frequency,或者userid的list

merge部分把key一样的,类别一样的,value放到一个一起 Reducer 进一步对value部分进行合并,简化

中间有一个sort的环节,可以引申merge sort, quick sort

What is merge sort? What is quick sort? What are their relative advantages and disadvantages? 我主要说了merge sort是stable sort而且可以比较方便的去sort linked list或者是用k-way merge sort去sort external files, quick sort是in place sort,比较快。 Merge sort: divide and conquer,stable (preserves the input order of equal elements in the sortedoutput) , external merge sort (tape drives),n log n worst, O(n)space, good for multithreading Quick sort: divide and conquer,pivot, not stable, n^2 worst, in place, faster with good pivot

  • merge sort Conceptually, a merge sort works as follows:

Divide the unsorted list into n sublists, each containing 1 element (a list of 1 element is considered sorted). Repeatedly merge sublists to produce new sorted sublists until there is only 1 sublist remaining. This will be the sorted list.

  • quick sort Quicksort is a divide and conquer algorithm. Quicksort first divides a large array into two smaller sub-arrays: the low elements and the high elements. Quicksort can then recursively sort the sub-arrays.

The steps are:

Pick an element, called a pivot, from the array. Partitioning: reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation. Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.

举例 wordCount

mapreduce讲解

Auto-complete

第一步要建立N-Gram Model

Predict N-Gram based on N-Gram I love big -> data course/brother/island around

Use Probability to predict next word/phrase

  • Let's build N-Gram Model
steps:
  • Read a large-scale document collections
  • Build n-gram library
  • Calculate probability
  • Run the project on Mapreduce

Predict N-Gram based on 1-Gram

Build N-Gram Library

Let's use MR to solve this problem!

  • Mapper 几次划分都属于mapper

N-Gram:

一次划分,二次划分 this, is, cool 一次划分 this is, is cool 二次划分 this is cool

Key? Starting word/phrase

Value? Following N-Gram with count

How to get the probability Example : this 1000 this is 500 this is a 125 P(is|this) = Count(this is)/Count(this) = 0.5 is cool 10 -> is , cool = 10 cool since 18 -> cool , since = 18

is cool since

  • Reducer

Hint:

For a given phrase, store only the top n words with highest probablities. This value should also be a command-line parameter to your MapReduce application. If two words have the same probablity, choose the one which is lexicographically higher i.e. 'ab' comes before 'bc'. ##

Top K elements
1, Single Node (Order by Frequency)

2, Top K on multiple nodes

如果查微博,搜热搜,还能用single node吗

不可以: 1, 文件太大, 单机无法处理 2, 处理速度太慢

如何实现multiple nodes 的Top K (divide & combine)

  • cut into small files
  • Dispatch small files to different machines(nodes)
  • Get topk from each machine
  • Calculate topK from the topK combination

Big file -> Master Node (top k) slave 1(small file) -> top 1 , slave 2 - > top 2, slave 3 - > top 3

如果需要实时的数据进行传输,更新求top K ,

三种方法, 1, write new data to disk file

  • when new data comes in, write it to disk file
  • when server request for top K, run the algorithm on disk file

不好之处在于 too slow!

2, write new data to hashmap

  • When new data comes in , write it to hashmap
  • When hashmap is updated, trigger the PQ at the same time

不好之处在于, out of memory, Data loss when node failure or power off

3, Replacement for HashMap:

  • store data in database
  • update counter in database

Using Treemap to replace PQ because , order by value, support all the functions in PQ, support find and delete by key

Recommendation System based on matrix

1, Algorithm used for recommendation system

● User Collaborative Filtering (User CF)

● Item Collaborative Filtering (Item CF)

User CF

  • a form of CF based on the similarity between users calculated using people's ratings of thoes itemsI Item CF iterms
  • the number of users weighs more than number of products , using item CF
  • Item will not change frequently , lowering calculation
  • Using user's historical data, more convincing
步骤
  • Build co-occurence matrix
  • Build rating matrix
  • Matrix compuation to get recommending result 第一步,decribe the relationship between different items We will using the rating history to build relationship between movies (if one user rated two movies, these two are related)

rating matrix is for each user Movie UserB rating M1 3 M2 7 M3 8 M4 0 M5 0

  • Bulid co-occurence matrix (从user和movie的矩阵 转变成 movie和movie的矩阵)
  • Build rating matrix (针对一个user的打过分的movie的表)
  • Multiply co-occurence matrix and rating matrix (两个相乘, 得到对应的一个score,从而未看过的movie的分数可以求出一个估计值,那么就是一个推测值)

运用Mapper 和 Reducer 很关键 Mapper: userid, Movie_id, Rating

Reducer : (merge)

Userid, Movieid:Rating, Movieid : Rating 得到相同的user看到哪些电影和相对应的分数

User M1 M2 M3 M4 M5
A B
C D

 M1  M2 M3 M4 M5

M1 M2 M3 M4 M5

results matching ""

    No results matching ""